
The Amazon Basin sees the largest amount of deforestation in the world. Better data about location of deforestation and human impact allow governments and stakeholders to better respond more effectively.
This satelliate data comes from Planet Labs, a company that designs CubeSats and owns the largest collection of Earth-orbiting satellites that can collect daily images of the Earth at 3-5 meter resolution.
Traditional methods for tracking changes in forests usually depend on coarse-resolution images from Landsat or MODIS that can have trouble differentiating between human caused and natural forest loss.

# imports
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
% matplotlib inline
from IPython.display import Image
# paths
IMG_PATH = '/Users/CJL/Documents/class_4360/amazon/'
TRAIN_LABELS = '/Users/CJL/Documents/class_4360/train_v2.csv'
IMG_EXT = '.jpg'

F1 Score
$F_1 = 2 \cdot \frac{1}{\tfrac{1}{\mathrm{recall}} + \tfrac{1}{\mathrm{precision}}} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$
$F = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{(1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive}}$
Beta F-Score
$(1 + \beta^2) \frac{pr}{\beta^2 p+r}\ \ \mathrm{where}\ \ p = \frac{tp}{tp+fp},\ \ r = \frac{tp}{tp+fn},\ \beta = 2.$
The beta parameter determines the weight of precision in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> inf only recall).
image_name,tags
test_0,agriculture road water
test_1,primary clear
test_2,haze primary
etc.
df_train = pd.read_csv(TRAIN_LABELS)
df_train.head(10)
| image_name | tags | |
|---|---|---|
| 0 | train_0 | haze primary |
| 1 | train_1 | agriculture clear primary water |
| 2 | train_2 | clear primary |
| 3 | train_3 | clear primary |
| 4 | train_4 | agriculture clear habitation primary road |
| 5 | train_5 | haze primary water |
| 6 | train_6 | agriculture clear cultivation primary water |
| 7 | train_7 | haze primary |
| 8 | train_8 | agriculture clear cultivation primary |
| 9 | train_9 | agriculture clear cultivation primary road |


new_style = {'grid': False}
plt.rc('axes', **new_style)
_, ax = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(20, 20))
for i, v in enumerate(os.listdir(IMAGES)[67:76]):
plt.subplot(3, 3, i+1)
plt.imshow(plt.imread('{}{}'.format(IMG_PATH, v)))
plt.title('{}'.format(df_train[df_train.image_name == v[:-4]]))
new_style = {'grid': False}
plt.rc('axes', **new_style)
_, ax = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(20, 20))
for i, v in enumerate(os.listdir(IMAGES)[9:18]):
plt.subplot(3, 3, i+1)
plt.imshow(plt.imread('{}{}'.format(IMG_PATH, v)))
plt.title('{}'.format(df_train[df_train.image_name == v[:-4]]))
new_style = {'grid': False}
plt.rc('axes', **new_style)
_, ax = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(20, 20))
for i, v in enumerate(os.listdir(IMAGES)[87:96]):
plt.subplot(3, 3, i+1)
plt.imshow(plt.imread('{}{}'.format(IMG_PATH, v)))
plt.title('{}'.format(df_train[df_train.image_name == v[:-4]]))
squareform and pdist from scipy.spatial.distance




# XGBOOST
import numpy as np
from tqdm import tqdm
import scipy
for class_i in tqdm(range(n_classes), miniters=1):
model = xgb.XGBClassifier(max_depth=5,
learning_rate=0.1,
n_estimators=100,
silent=True,
objective='binary:logistic',
nthread=-1,
gamma=0,
min_child_weight=1,
max_delta_step=0,
subsample=1,
colsample_bytree=1,
colsample_bylevel=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
base_score=0.5,
seed=random_seed,
missing=None)
model.fit(X, y[:, class_i])
y_pred[:, class_i] = model.predict_proba(X_test)[:, 1]
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-68-ac105bc45e03> in <module>() 1 import numpy as np ----> 2 from tqdm import tqdm 3 import scipy 4 5 for class_i in tqdm(range(n_classes), miniters=1): ImportError: No module named 'tqdm'
# 3 layer CNN
import cv2
from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(64, 64, 3)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(17, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=128, epochs=50, verbose=1,
validation_data=(x_valid, y_valid))
# Epoch 50/50
# 20000/20000 [==============================] - 11s -
# loss: 0.0769 - acc: 0.9658 - val_loss: 0.2194 - val_acc: 0.9400
from sklearn.metrics import fbeta_score
p_valid = model.predict(x_valid, batch_size=128)
print(fbeta_score(y_valid, np.array(p_valid) > 0.2, beta=2, average='samples'))
# F-score
# 0.854775001503
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-66-3fe8d2012445> in <module>() ----> 1 import cv2 2 from keras.models import Sequential 3 from keras.layers import * 4 from keras.optimizers import Adam 5 ImportError: No module named 'cv2'
train_df = pd.read_csv(TRAIN_LABELS)
tagnames = list(set([tag for sublist in train_df['tags'].apply(lambda tagstring: tagstring.split()) for tag in sublist]))
print('Tags: ' + '%s' % ', '.join(tagnames))
def containsTag(tag, taglist):
if tag in taglist.split(): return 1
else: return 0
for tag in tagnames:
train_df[tag] = train_df['tags'].apply(lambda taglist: containsTag(tag, taglist))
train_df.head()
Tags: partly_cloudy, haze, primary, cloudy, agriculture, bare_ground, selective_logging, road, cultivation, slash_burn, conventional_mine, water, clear, artisinal_mine, blooming, blow_down, habitation
| image_name | tags | partly_cloudy | haze | primary | cloudy | agriculture | bare_ground | selective_logging | road | cultivation | slash_burn | conventional_mine | water | clear | artisinal_mine | blooming | blow_down | habitation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | train_0 | haze primary | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | train_1 | agriculture clear primary water | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | train_2 | clear primary | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | train_3 | clear primary | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | train_4 | agriculture clear habitation primary road | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
train_df[train_df[['clear', 'haze', 'partly_cloudy', 'cloudy']].sum(axis=1) != 1]
| image_name | tags | partly_cloudy | haze | primary | cloudy | agriculture | bare_ground | selective_logging | road | cultivation | slash_burn | conventional_mine | water | clear | artisinal_mine | blooming | blow_down | habitation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24448 | train_24448 | water | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Labels.loc[24448, 'clear'] = 1
Labels.loc[24448, 'tags'] = 'clear water'
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-85-343c883966a5> in <module>() ----> 1 Labels.loc[24448, 'clear'] = 1 2 Labels.loc[24448, 'tags'] = 'clear water' NameError: name 'Labels' is not defined
import seaborn as sns
ConditionCounts = train_df[['clear', 'haze', 'partly_cloudy', 'cloudy']].sum().sort_values(ascending=False)
ConditionCounts = ConditionCounts.reset_index(name='Count')
fig = sns.barplot(x='index', y='Count', data=ConditionCounts)
fig.set(xlabel='Condition Type', ylabel='Count', title='Images by Condition')
fig.grid(True, axis='y', ls=':')
typeTags = ['artisinal_mine', 'blow_down', 'water', 'cultivation', 'road', 'agriculture',
'bare_ground', 'blooming', 'selective_logging', 'habitation', 'conventional_mine',
'slash_burn', 'primary']
TypeCounts = train_df[train_df.columns[train_df.columns.isin(typeTags)]].sum().sort_values(ascending=False)
TypeCounts = TypeCounts.reset_index(name='Count')
fig = plt.figure(figsize=(12, 4))
fig = sns.barplot(x='index', y='Count', data=TypeCounts)
fig.set(xlabel='Type', ylabel='Count', title='Images by Type')
fig.grid(True, axis='y', ls=':')
# Rotate the tick-labels
for ticklabel in fig.get_xticklabels():
ticklabel.set_rotation(45)
# Focusing on the rare types:
RareTypeCounts = train_df[['artisinal_mine', 'blow_down', 'bare_ground', 'blooming',
'selective_logging', 'conventional_mine',
'slash_burn']].sum().sort_values(ascending=False)
RareTypeCounts = RareTypeCounts.reset_index(name='Count')
fig = plt.figure(figsize=(8, 4))
fig = sns.barplot(x='index', y='Count', data=RareTypeCounts)
fig.set(xlabel='Type', ylabel='Count', title='Images by Less Common Types')
fig.grid(True, axis='y', ls=':')
# Rotate the tick-labels
for ticklabel in fig.get_xticklabels():
ticklabel.set_rotation(45)
for condition in ['clear', 'haze', 'partly_cloudy']:
_ConditionalTypeCounts = train_df[train_df.columns[train_df.columns.isin(typeTags)]][train_df[condition] == 1].sum().sort_values(ascending=False)
_ConditionalTypeCounts = _ConditionalTypeCounts.reset_index(name='Count')
fig = plt.figure(figsize=(12, 4))
fig = sns.barplot(x='index', y='Count', data=_ConditionalTypeCounts)
fig.set(xlabel='Type', ylabel='Count', title='Images by Type for Condition %s' % condition)
fig.grid(True, axis='y', ls=':')
# Rotate the tick-labels
for ticklabel in fig.get_xticklabels():
ticklabel.set_rotation(45)

pd.set_option("display.max_columns", 100)
color_df = pd.read_csv('images/ColorStatsDF.csv')
color_df.head()
| Unnamed: 0 | r_mean | r_median | r_std | r_max | r_min | r_kurtosis | r_skewness | g_mean | g_median | g_std | g_max | g_min | g_kurtosis | g_skewness | b_mean | b_median | b_std | b_max | b_min | b_kurtosis | b_skewness | n_mean | n_median | n_std | n_max | n_min | n_kurtosis | n_skewness | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2823.257050 | 2819 | 85.647217 | 3396 | 2469 | 0.965014 | 0.387792 | 4369.526978 | 4367 | 83.879425 | 4856 | 4034 | 0.547369 | 0.233650 | 5226.039261 | 5224 | 91.665121 | 5794 | 4860 | 0.597669 | 0.165912 | 6126.069626 | 6122 | 406.868986 | 8527 | 4559 | 0.295068 | 0.112706 |
| 1 | 1 | 2477.354996 | 2336 | 383.497120 | 5510 | 1809 | 4.923805 | 1.713841 | 3666.500900 | 3510 | 421.583711 | 6118 | 2836 | 0.554345 | 0.996892 | 4312.921616 | 4246 | 273.744368 | 6697 | 3637 | 4.451386 | 1.410764 | 7140.713974 | 7086 | 1295.642057 | 11467 | 3128 | -0.383098 | 0.041964 |
| 2 | 2 | 2030.450165 | 2031 | 112.283244 | 2752 | 1613 | 0.298555 | 0.058924 | 3214.830826 | 3217 | 112.536970 | 3820 | 2788 | 0.101043 | -0.037807 | 3952.491165 | 3952 | 96.707288 | 4537 | 3572 | 0.362588 | 0.075739 | 5774.253342 | 5790 | 724.758868 | 8783 | 2964 | 0.039722 | -0.094088 |
| 3 | 3 | 2479.578049 | 2478 | 111.271727 | 3384 | 2090 | 1.189625 | 0.358904 | 3699.635529 | 3700 | 115.832992 | 4386 | 3239 | 0.342273 | 0.147777 | 4602.361969 | 4601 | 107.109551 | 5368 | 4095 | 0.765090 | 0.233627 | 5577.092361 | 5598 | 630.781636 | 8278 | 3434 | 0.048179 | -0.092494 |
| 4 | 4 | 3381.794922 | 3228 | 611.183348 | 8479 | 2348 | 6.517757 | 2.443548 | 4433.354370 | 4295 | 575.751792 | 9518 | 3352 | 5.582191 | 2.198323 | 5082.952789 | 5000 | 407.905704 | 10492 | 4210 | 9.057131 | 2.325429 | 8818.953049 | 8713 | 1314.023509 | 13208 | 4887 | -0.076711 | 0.422574 |



We used 5 KFold Cross Validation. So each model has 5 different weights set. 30 CNN models in total. 150 weight files

USGS Guide on Landsat: https://landsat.usgs.gov/sites/default/files/documents/si_product_guide.pdf
Cloud detection: https://weather.msfc.nasa.gov/sport/journal/pdfs/2009_GRS_Jedlovec.pdf
Water Detection: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4970121/
NDWI: https://en.wikipedia.org/wiki/Normalized_difference_water_index